Loading Packages¶
# Reading in, manipulations
import numpy as np
import pandas as pd
# Plotting
import plotly_express as px
import plotly.graph_objects as go
import plotly.io as pio
from plotly.subplots import make_subplots
# Clustering
from sklearn.preprocessing import PowerTransformer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
# Other
import warnings
warnings.filterwarnings(action = "ignore")
Setting my theme and color scheme
pio.templates.default = "plotly_white"
pio.renderers.default = "notebook"
custom_color_scale = ['#0000FF', '#33CCFF', '#66FF99', '#D2D200', '#FFFF00']
Reading data
df = pd.read_csv("data/cc_general.csv")
EDA¶
df.head()
| CUST_ID | BALANCE | BALANCE_FREQUENCY | PURCHASES | ONEOFF_PURCHASES | INSTALLMENTS_PURCHASES | CASH_ADVANCE | PURCHASES_FREQUENCY | ONEOFF_PURCHASES_FREQUENCY | PURCHASES_INSTALLMENTS_FREQUENCY | CASH_ADVANCE_FREQUENCY | CASH_ADVANCE_TRX | PURCHASES_TRX | CREDIT_LIMIT | PAYMENTS | MINIMUM_PAYMENTS | PRC_FULL_PAYMENT | TENURE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C10001 | 40.900749 | 0.818182 | 95.40 | 0.00 | 95.4 | 0.000000 | 0.166667 | 0.000000 | 0.083333 | 0.000000 | 0 | 2 | 1000.0 | 201.802084 | 139.509787 | 0.000000 | 12 |
| 1 | C10002 | 3202.467416 | 0.909091 | 0.00 | 0.00 | 0.0 | 6442.945483 | 0.000000 | 0.000000 | 0.000000 | 0.250000 | 4 | 0 | 7000.0 | 4103.032597 | 1072.340217 | 0.222222 | 12 |
| 2 | C10003 | 2495.148862 | 1.000000 | 773.17 | 773.17 | 0.0 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0 | 12 | 7500.0 | 622.066742 | 627.284787 | 0.000000 | 12 |
| 3 | C10004 | 1666.670542 | 0.636364 | 1499.00 | 1499.00 | 0.0 | 205.788017 | 0.083333 | 0.083333 | 0.000000 | 0.083333 | 1 | 1 | 7500.0 | 0.000000 | NaN | 0.000000 | 12 |
| 4 | C10005 | 817.714335 | 1.000000 | 16.00 | 16.00 | 0.0 | 0.000000 | 0.083333 | 0.083333 | 0.000000 | 0.000000 | 0 | 1 | 1200.0 | 678.334763 | 244.791237 | 0.000000 | 12 |
Making columns lowercase
df.columns = df.columns.str.lower()
Checking for NAs
df.isnull().sum().sort_values(ascending = False)
minimum_payments 313 credit_limit 1 cust_id 0 balance 0 prc_full_payment 0 payments 0 purchases_trx 0 cash_advance_trx 0 cash_advance_frequency 0 purchases_installments_frequency 0 oneoff_purchases_frequency 0 purchases_frequency 0 cash_advance 0 installments_purchases 0 oneoff_purchases 0 purchases 0 balance_frequency 0 tenure 0 dtype: int64
We've got some missing values, will deal with them later. Let's check the distributions ...
df.describe()
| balance | balance_frequency | purchases | oneoff_purchases | installments_purchases | cash_advance | purchases_frequency | oneoff_purchases_frequency | purchases_installments_frequency | cash_advance_frequency | cash_advance_trx | purchases_trx | credit_limit | payments | minimum_payments | prc_full_payment | tenure | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8949.000000 | 8950.000000 | 8637.000000 | 8950.000000 | 8950.000000 |
| mean | 1564.474828 | 0.877271 | 1003.204834 | 592.437371 | 411.067645 | 978.871112 | 0.490351 | 0.202458 | 0.364437 | 0.135144 | 3.248827 | 14.709832 | 4494.449450 | 1733.143852 | 864.206542 | 0.153715 | 11.517318 |
| std | 2081.531879 | 0.236904 | 2136.634782 | 1659.887917 | 904.338115 | 2097.163877 | 0.401371 | 0.298336 | 0.397448 | 0.200121 | 6.824647 | 24.857649 | 3638.815725 | 2895.063757 | 2372.446607 | 0.292499 | 1.338331 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 50.000000 | 0.000000 | 0.019163 | 0.000000 | 6.000000 |
| 25% | 128.281915 | 0.888889 | 39.635000 | 0.000000 | 0.000000 | 0.000000 | 0.083333 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1600.000000 | 383.276166 | 169.123707 | 0.000000 | 12.000000 |
| 50% | 873.385231 | 1.000000 | 361.280000 | 38.000000 | 89.000000 | 0.000000 | 0.500000 | 0.083333 | 0.166667 | 0.000000 | 0.000000 | 7.000000 | 3000.000000 | 856.901546 | 312.343947 | 0.000000 | 12.000000 |
| 75% | 2054.140036 | 1.000000 | 1110.130000 | 577.405000 | 468.637500 | 1113.821139 | 0.916667 | 0.300000 | 0.750000 | 0.222222 | 4.000000 | 17.000000 | 6500.000000 | 1901.134317 | 825.485459 | 0.142857 | 12.000000 |
| max | 19043.138560 | 1.000000 | 49039.570000 | 40761.250000 | 22500.000000 | 47137.211760 | 1.000000 | 1.000000 | 1.000000 | 1.500000 | 123.000000 | 358.000000 | 30000.000000 | 50721.483360 | 76406.207520 | 1.000000 | 12.000000 |
There are quite some outliers!
df.drop(['cust_id'], axis = 1, inplace = True)
Removing the customer Id as we will not need it.
fig = px.box(df)
fig.update_layout(
title_text = 'Not only quite some outliers but the range across variables is different.'
)
fig.show()
We will need to standardize the data later for clustering.
nr_columns = len(df.columns)
nr_rows = (nr_columns + 1) // 2
nr_cols = 2
fig = make_subplots(rows = nr_rows, cols = nr_cols, subplot_titles = df.columns)
for i, col in enumerate(df.columns):
row_num = (i // nr_cols) + 1
col_num = (i % nr_cols) + 1
fig.add_trace(go.Histogram(x = df[col], nbinsx = 30, name = str(col)), row = row_num, col = col_num)
fig.update_layout(height = 900, width = 600, showlegend = False)
fig.update_xaxes(tickfont = dict(size = 8))
fig.update_yaxes(tickfont = dict(size = 8))
fig.show()
df[df['purchases'] == 0]
| balance | balance_frequency | purchases | oneoff_purchases | installments_purchases | cash_advance | purchases_frequency | oneoff_purchases_frequency | purchases_installments_frequency | cash_advance_frequency | cash_advance_trx | purchases_trx | credit_limit | payments | minimum_payments | prc_full_payment | tenure | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 3202.467416 | 0.909091 | 0.0 | 0.0 | 0.0 | 6442.945483 | 0.0 | 0.0 | 0.0 | 0.250000 | 4 | 0 | 7000.0 | 4103.032597 | 1072.340217 | 0.222222 | 12 |
| 14 | 2772.772734 | 1.000000 | 0.0 | 0.0 | 0.0 | 346.811390 | 0.0 | 0.0 | 0.0 | 0.083333 | 1 | 0 | 3000.0 | 805.647974 | 989.962866 | 0.000000 | 12 |
| 16 | 2072.074354 | 0.875000 | 0.0 | 0.0 | 0.0 | 2784.274703 | 0.0 | 0.0 | 0.0 | 0.250000 | 3 | 0 | 3000.0 | 391.974562 | 376.579631 | 0.000000 | 8 |
| 24 | 5368.571219 | 1.000000 | 0.0 | 0.0 | 0.0 | 798.949863 | 0.0 | 0.0 | 0.0 | 0.363636 | 4 | 0 | 6000.0 | 1422.726707 | 1657.002877 | 0.000000 | 11 |
| 35 | 1656.350781 | 1.000000 | 0.0 | 0.0 | 0.0 | 99.264367 | 0.0 | 0.0 | 0.0 | 0.083333 | 1 | 0 | 1800.0 | 351.500199 | 634.569311 | 0.000000 | 12 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8920 | 1055.087681 | 0.666667 | 0.0 | 0.0 | 0.0 | 1820.116200 | 0.0 | 0.0 | 0.0 | 0.333333 | 2 | 0 | 2000.0 | 54.795084 | 233.229870 | 0.000000 | 6 |
| 8929 | 371.527312 | 0.333333 | 0.0 | 0.0 | 0.0 | 1465.407927 | 0.0 | 0.0 | 0.0 | 0.166667 | 5 | 0 | 1500.0 | 0.000000 | NaN | 0.000000 | 6 |
| 8937 | 163.001629 | 0.666667 | 0.0 | 0.0 | 0.0 | 274.440466 | 0.0 | 0.0 | 0.0 | 0.166667 | 3 | 0 | 1000.0 | 117.738787 | 60.792450 | 0.000000 | 6 |
| 8938 | 78.818407 | 0.500000 | 0.0 | 0.0 | 0.0 | 1113.186078 | 0.0 | 0.0 | 0.0 | 0.166667 | 7 | 0 | 1200.0 | 1397.770131 | 21.821194 | 0.333333 | 6 |
| 8948 | 13.457564 | 0.833333 | 0.0 | 0.0 | 0.0 | 36.558778 | 0.0 | 0.0 | 0.0 | 0.166667 | 2 | 0 | 500.0 | 52.549959 | 55.755628 | 0.250000 | 6 |
2044 rows × 17 columns
- We've got customers that had no purchases, however their balance did go down due to cash advances. Could also because of fees or interest rates.
- Also, the balance does not seem to match up if we add together payments and cash advances so it could be that the balance considers the previous billing cycle.
OK, let's start by looking at the most important variable when it comes to credit cards and that is their limit.
fig = px.histogram(df, x = 'credit_limit', nbins = 100)
fig.show()
As expected, most limits are low.
fig = px.scatter(df, x = 'credit_limit', y = 'payments')
fig.show()
So we can see ...
- we've got a variety of purchase behaviors and ...
- higher credit limit does not presume higher payment amounts.
fig = px.scatter(df, x = 'purchases', y = 'payments', color = 'cash_advance', trendline = 'ols')
fig.show()
- But the more purchases the more likely the payments are high.
- As noticed before, we can see quite some customers not making purchases, partly becaus of cash advances and the other part cannot be explained from this dataset.
It's still a bit fuzzy, let's move on to correlations ...
correlation_matrix = df.corr()
fig = go.Figure(
data = go.Heatmap(
x = correlation_matrix.columns,
y = correlation_matrix.index,
z = correlation_matrix.values,
colorscale = 'YlGnBu')
)
fig.show()
Things that impact each other, of course are highly correlated as ...
- ... credit limit and balance
- ... purchases frequency and one-off purchases frequency
- ... purchases and installments purchases
- ... etc.
That means, quite some variables probably will not be needed to cluster our customers.
For really understanding where the most variation comes from I will use PCA. But before that, our features need some pre-processing!
Pre-processing¶
- Inputting missing values - we have a right-skewed distribution, so will use the median to impute the missing numbers.
- Dropping the one row that has a missing value in credit limit.
fig = px.histogram(df, x = 'minimum_payments', nbins = 50)
fig.show()
df.dropna(subset = ['credit_limit'], inplace = True)
df['minimum_payments'].fillna(df['minimum_payments'].median(), inplace = True)
print(df.isnull().sum())
balance 0 balance_frequency 0 purchases 0 oneoff_purchases 0 installments_purchases 0 cash_advance 0 purchases_frequency 0 oneoff_purchases_frequency 0 purchases_installments_frequency 0 cash_advance_frequency 0 cash_advance_trx 0 purchases_trx 0 credit_limit 0 payments 0 minimum_payments 0 prc_full_payment 0 tenure 0 dtype: int64
As we have observed, most variables are skewed, so we need to ...
- Normalize the variables with PowerTransformer
- Store it separately for clustering
P.S Using PowerTransformer for (1) since we have a variation of different distributions (right skewed, negatively skewed and bimodal) and the features have significantly different scales and variances.
scaler = PowerTransformer()
X = scaler.fit_transform(df)
np.allclose(np.mean(X), 0)
np.allclose(np.std(X), 1)
True
Superfluous step but just so the user know that the data is clearly normalized. Finally, it's time for PCA!
PCA¶
- Use all variables
- Check the highest variance explained
- Plot loadings
pca = PCA(n_components = None)
X_pca = pca.fit_transform(X)
X_pca_df = pd.DataFrame(X_pca, columns = df.columns)
print(X_pca_df)
balance balance_frequency purchases oneoff_purchases \
0 -0.256402 -3.117240 -0.212140 -1.039162
1 -3.480136 0.875292 -0.180773 1.335073
2 1.337092 1.187997 2.191266 -1.742006
3 -1.373930 -0.533114 2.053472 -0.580049
4 -1.009702 -1.406326 1.019393 -1.981967
... ... ... ... ...
8944 1.863231 -3.006702 -1.740143 1.166673
8945 1.296106 -2.132573 -1.919821 -0.560237
8946 1.282313 -3.657642 -1.268103 1.104514
8947 -2.809793 -3.916708 0.107539 1.826599
8948 -0.347167 -0.562417 2.151005 1.528104
installments_purchases cash_advance purchases_frequency \
0 0.060030 0.155493 -0.679016
1 2.105084 -0.295592 0.679257
2 0.009938 -0.361336 0.290415
3 -0.582991 -0.741878 -2.211714
4 0.287524 0.600037 0.931382
... ... ... ...
8944 -1.596956 -0.006751 2.150340
8945 -2.341109 -0.883149 1.208490
8946 -1.827778 -0.695729 1.084660
8947 -1.963909 0.650321 1.174312
8948 -3.911358 0.124713 -0.796946
oneoff_purchases_frequency purchases_installments_frequency \
0 0.873231 0.303741
1 0.268776 -0.647957
2 -0.722455 -0.272510
3 -0.887097 -1.522410
4 0.426215 0.624754
... ... ...
8944 -0.779580 0.499237
8945 0.028846 0.568643
8946 -0.165126 -0.362678
8947 -0.233841 -0.353835
8948 -0.029382 -0.248305
cash_advance_frequency cash_advance_trx purchases_trx credit_limit \
0 -0.386776 0.164891 0.089532 0.632001
1 0.133657 -0.360332 -0.357819 -0.154056
2 1.218358 0.169826 -0.495254 0.046993
3 -1.112262 0.836977 -0.973376 -0.912798
4 -0.056678 -0.291900 -0.165492 -0.003175
... ... ... ... ...
8944 0.013688 0.100195 -0.079311 -0.261552
8945 0.481943 0.226928 1.089301 -0.119652
8946 0.169422 -0.104195 -0.238034 -0.237678
8947 -0.300750 -0.121237 -0.046715 0.186797
8948 0.697003 0.623269 -0.723173 0.390070
payments minimum_payments prc_full_payment tenure
0 0.106505 -0.178227 -0.133516 -0.023747
1 0.055915 -0.001065 0.002851 -0.136091
2 -0.083213 -0.499905 -0.251716 -0.011544
3 0.127374 -0.273642 0.196251 -0.074892
4 -0.107770 0.108535 -0.018660 0.011435
... ... ... ... ...
8944 -0.032893 -0.253187 -0.032960 0.057837
8945 0.107551 -0.306901 0.000984 0.031057
8946 -0.106563 -0.164333 -0.033685 0.040150
8947 -0.302203 0.136741 0.030968 0.067428
8948 -0.371654 0.177702 -0.021312 -0.365567
[8949 rows x 17 columns]
fig = go.Figure()
fig.add_trace(go.Scatter(
x = list(range(1, len(pca.explained_variance_) + 1)),
y = pca.explained_variance_,
marker = dict(color = 'blue')
))
fig.update_layout(
title = "The bend occurs at eigenvalue number 3.",
xaxis = dict(title = "Eigenvalue number"),
yaxis = dict(title = "Eigenvalue size"),
)
fig.show()
It could be a signal that using the first 3 components could be enough. Let's look at the explained variance.
exp_var_cumul = np.cumsum(pca.explained_variance_ratio_)
fig = px.area(
x = range(1, exp_var_cumul.shape[0] + 1),
y = exp_var_cumul,
labels = {"x": "# Components", "y": "Explained Variance"}
)
fig.show()
It's around 5 components where we keep 80% of the explained variance.
features = df.columns
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
fig = px.scatter(X_pca, x = 0, y = 1)
for i, feature in enumerate(features):
fig.add_annotation(
ax = 0, ay = 0,
axref="x", ayref="y",
x = loadings[i, 0],
y = loadings[i, 1],
arrowsize = 0.5,
arrowhead = 1,
xanchor = "right",
yanchor = "top")
fig.add_annotation(
x=loadings[i, 0],
y=loadings[i, 1],
ax=0, ay=0,
text = f"{feature} (PC1): {loadings[i, 1]:.2f} and (PC2): {loadings[i, 0]:.2f}",
yshift=-15
)
fig.show()
- We can notice one clear cluster on the left side separate from the big cluster.
- The features that contribute the most to PC1 are the balance and minimum payments, whereas, for PC2 (the y-axis), the largest contributors are a mix from purchases, installments and cash advances.
- Variables that close together such as puchases & purchases_trx or oneoff_purchases_frequency and oneoff_purchases are positively correlated.
We can notice one clear cluster on the left side separate from the big cluster.
Will use 3 PCA components as features for the Kmeans, as it explains enough of the variance and does not add too much complexity.
pca = PCA(n_components = 3)
X_pca = pca.fit_transform(X)
K-Means¶
kmeans_models = [KMeans(n_clusters = k, random_state = 1, verbose = False).fit(X_pca) for k in range (1, 10)]
inertia = [model.inertia_ for model in kmeans_models]
fig = go.Figure()
fig.add_trace(go.Scatter(
x = list(range(1, 12)),
y = inertia,
mode = 'lines+markers',
marker = dict(size = 10), line = dict(color = 'blue'))
)
fig.update_layout(title = 'The elbow is around 5 clusters.',
xaxis = dict(title = 'Number of Clusters'),
yaxis = dict(title = 'WCSS'))
fig.show()
The intertia drop is minimal after K = 5, therefore 5 would be considered as an optimal number of clusters.
from sklearn.metrics import silhouette_score
silhouette_scores = [silhouette_score(X_pca, model.labels_) for model in kmeans_models[1:5]]
for k, score in enumerate(silhouette_scores, start = 2):
print(f"Number of clusters: {k}")
print(f"Silhouette score: round{score}")
print()
Number of clusters: 2 Silhouette score: round0.37718202525424227 Number of clusters: 3 Silhouette score: round0.36695690303122364 Number of clusters: 4 Silhouette score: round0.3860108281916691 Number of clusters: 5 Silhouette score: round0.40085459141940327
Will move on with 5 clusters. Let's see how it looks like.
kmeans = KMeans(n_clusters = 5, random_state = 23)
kmeans.fit(X_pca)
KMeans(n_clusters=5, random_state=23)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=5, random_state=23)
Adding the clusters into the dataframe ...
df_pca_kmeans = pd.concat([df.reset_index(drop = True), pd.DataFrame(X_pca)], axis = 1)
df_pca_kmeans.columns.values[-3: ] = ['PC_1', 'PC_2', 'PC_3']
df_pca_kmeans = df_pca_kmeans.reset_index(drop = True)
df_pca_kmeans['cluster_id'] = kmeans.labels_
cluster_names = {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}
df_pca_kmeans['cluster'] = df_pca_kmeans['cluster_id'].map(cluster_names)
x_axis = df_pca_kmeans['PC_1']
y_axis = df_pca_kmeans['PC_2']
fig = px.scatter(x = x_axis, y = y_axis, color = df_pca_kmeans['cluster'], color_continuous_scale = custom_color_scale)
fig.show()
The model is not perfect, but unfortunately the points do not create super clear clusters, as there is quite some variance.
Introducing the third component ...
z_axis = df_pca_kmeans['PC_3']
fig = px.scatter_3d(x = x_axis, y = y_axis, z = z_axis, color = df_pca_kmeans['cluster'], color_continuous_scale = custom_color_scale)
fig.show()
Now, let's go back to the original data and see what that actually means.
fig = px.box(df_pca_kmeans, x = 'cluster', y = 'credit_limit')
fig.update_layout(
title_text = 'Credit limit is higher for cluster 1 and 2.'
)
fig.show()
We can see that the first two clusters are the ones with higher limit. Would assume they also spend more?
fig = px.scatter(df_pca_kmeans, x = 'credit_limit', y = 'purchases', color = 'cluster', color_continuous_scale = custom_color_scale)
fig.update_layout(
title_text = 'Cluster 4 and 5 are not big purchasers.'
)
fig.show()
Yes, that is definitely one additional differentiator. The lower part contain customers that do not make a lot of purchases - cluster 4, 5.
fig = px.scatter(df_pca_kmeans, x = 'credit_limit', y = 'balance', color = 'cluster', color_continuous_scale = custom_color_scale)
fig.update_layout(
title_text = '... especially Cluster 4 of customers does not use it for much.'
)
fig.show()
Now we can see that last two clusters (4, 5) are customers that do not make much business. But for the rest, this is not a clear differentiator.
Where does it come from then? Let's check cash advances, as we have seen in the past that there were quite some.
fig = px.box(df_pca_kmeans, x = 'cluster', y = 'cash_advance', log_y = True)
fig.update_layout(
title_text = '... Cash advances are a favourite for cluster 1 and 3.'
)
fig.show()
Here is where cluster 3 comes in. These are the customers who have an ok limit but they use it for quite some cash advances.
Let's go back to our PCA view, and observe the impact by certain variables.
x_axis = df_pca_kmeans['PC_1']
y_axis = df_pca_kmeans['PC_2']
fig = px.scatter(x = x_axis, y = y_axis, color = df_pca_kmeans['credit_limit'], color_continuous_scale = custom_color_scale)
fig.show()
Now, if we remember our clusters. We can already see that the lower part is mostly low credit limit and higher part of y axis is higher limit.
But, still, this explains a small part of it.
This variable seems to cluster the lower cluster, the 5th - the low limit customers that do not really make business.
fig = px.scatter(x = x_axis, y = y_axis, color = df_pca_kmeans['purchases_frequency'], color_continuous_scale = custom_color_scale)
fig.show()
Now we are getting somewhere.
- You see how the cluster on the left side is clearly far away from all? This is the no or almost no usage group.
- Then on the other side we've got the high purchase group.
df_cluster_metrics = df_pca_kmeans.groupby('cluster').describe().T.reset_index().rename(columns = {'level_0': 'Column', 'level_1': 'Metrics'})
df_cluster_profile = df_cluster_metrics[df_cluster_metrics['Metrics'] == '50%'].set_index('Column').reset_index()
df_cluster_profile.style.background_gradient(cmap = 'YlGnBu', axis = 1).hide_index()
| Column | Metrics | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|---|
| balance | 50% | 2705.591995 | 756.857540 | 1460.077343 | 164.634007 | 61.115414 |
| balance_frequency | 50% | 1.000000 | 1.000000 | 1.000000 | 0.857143 | 1.000000 |
| purchases | 50% | 824.770000 | 2077.825000 | 0.000000 | 368.405000 | 350.000000 |
| oneoff_purchases | 50% | 375.250000 | 1132.320000 | 0.000000 | 324.830000 | 0.000000 |
| installments_purchases | 50% | 289.250000 | 669.750000 | 0.000000 | 0.000000 | 335.620000 |
| cash_advance | 50% | 1574.545965 | 0.000000 | 1241.559904 | 0.000000 | 0.000000 |
| purchases_frequency | 50% | 0.750000 | 1.000000 | 0.000000 | 0.250000 | 0.833333 |
| oneoff_purchases_frequency | 50% | 0.200000 | 0.583333 | 0.000000 | 0.166667 | 0.000000 |
| purchases_installments_frequency | 50% | 0.500000 | 0.750000 | 0.000000 | 0.000000 | 0.750000 |
| cash_advance_frequency | 50% | 0.250000 | 0.000000 | 0.250000 | 0.000000 | 0.000000 |
| cash_advance_trx | 50% | 5.000000 | 0.000000 | 4.000000 | 0.000000 | 0.000000 |
| purchases_trx | 50% | 13.000000 | 29.000000 | 0.000000 | 4.000000 | 10.000000 |
| credit_limit | 50% | 5000.000000 | 6000.000000 | 3000.000000 | 3000.000000 | 2000.000000 |
| payments | 50% | 1664.863771 | 1866.978859 | 729.064946 | 533.568794 | 422.709468 |
| minimum_payments | 50% | 1026.828021 | 248.785651 | 475.102035 | 183.180968 | 165.163582 |
| prc_full_payment | 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.142857 |
| tenure | 50% | 12.000000 | 12.000000 | 12.000000 | 12.000000 | 12.000000 |
| PC_1 | 50% | -0.014990 | 3.043762 | -3.468358 | 0.130777 | 1.657189 |
| PC_2 | 50% | 2.550159 | 0.888861 | 0.017824 | -1.102028 | -1.875139 |
| PC_3 | 50% | -0.383794 | 0.451095 | -0.271980 | 1.958574 | -1.382261 |
| cluster_id | 50% | 0.000000 | 1.000000 | 2.000000 | 3.000000 | 4.000000 |
Cluster 1 (Active Cash Advance Customers) They have a high limit than most and they use it for purchasing stuff. Additionally, they make use of installments and a bit of cash advances.
Cluster 2 (All-in Active Customers) They have a higher limit than Cluster 1, however they purchase even more frequently and have a higher use of installments.
Cluster 3 (Cash Advance Customers) They have a lower limit and use a higher proportion of it for cash advances than others.
Cluster 4 (Dead Customers) They do no do much. They do not buy, they do not take cash advances.
Cluster 5 (Installment Customers) They prefer safety and therefore to make purchases via installments, and do not take any cash advance.
So what kind of marketing could the bank target depending on the customer?
Marketing Strategies¶
Cluster 1 (Active Cash Advance Customers):
- How: Provide discounts or rewards for specific purchase categories.
- Why: Maintain the interest.
Cluster 2 (All-in Active Customers):
- How: Introduce exclusive rewards for high-frequency transactions.
- Why: This group has the money and like to spend it.
Cluster 3 (Cash Advance Customers):
- How: Explore other credit card types e.g. low cash advance but low fees and higher tenure.
- Why: Too frequent cash advances are risky for a bank. In the best case, this group of customers would decrease in size and move to the Active Customer group.
Cluster 4 (Dead Customers):
- How: Consider contacting customers with surveys to understand their disengagement reasons.
- Why: This group would need to be understood first before targeting with a marketing technique. Why don't they purchase? Are these people with lower income, are they young?
Cluster 5 (Installment Customers):
- How: Develop partnerships with merchants offering installment-friendly services and potentially that are low in interest.
- Why: It is scalable across other clusters e.g. dead customers.